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Abstract 



This paper provides lower bounds on the convergence rate of Derivative Free Op- 
timization (DFO) with noisy function evaluations, exposing a fundamental and 
unavoidable gap between the performance of algorithms with access to gradients 
and those with access to only function evaluations. However, there are situations 
in which DFO is unavoidable, and for such situations we propose a new DFO al- 
gorithm that is proved to be near optimal for the class of strongly convex objective 
functions. A distinctive feature of the algorithm is that it uses only Boolean-valued 
C$ ■ function comparisons, rather than function evaluations. This makes the algorithm 

^ | useful in an even wider range of applications, such as optimization based on paired 

comparisons from human subjects, for example. We also show that regardless of 
whether DFO is based on noisy function evaluations or Boolean-valued function 
comparisons, the convergence rate is the same. 

CO 

1 Introduction 

<N 

\ Optimizing large-scale complex systems often requires the tuning of many parameters. With train- 



ing data or simulations one can evaluate the relative merit, or incurred loss, of different parameter 
settings, but it may be unclear how each parameter influences the overall objective function. In such 
cases, derivatives of the objective function with respect to the parameters are unavailable. Thus, 
we have seen a resurgence of interest in Derivative Free Optimization (DFO) fl][2l[3][4][5]|6]|2]|8). 
When function evaluations are noiseless, DFO methods can achieve the same rates of convergence 
as noiseless gradient methods up to a small factor depending on a low-order polynomial of the di- 
mension |9]|5][10]. This leads one to wonder if the same equivalence can be extended to the case 
when function evaluations and gradients are noisy. 

Sadly, this paper proves otherwise. We show that when function evaluations are noisy, the opti- 
mization error of any DFO is 0(y / l/T), where T is the number of evaluations. This lower bound 
holds even for strongly convex functions. In contrast, noisy gradient methods exhibit 0(1/T) error 
scaling for strongly convex functions J9][ll|. A consequence of our theory is that finite differencing 
cannot achieve the rates of gradient methods when the function evaluations are noisy. 

On the positive side, we also present a new derivative-free algorithm that achieves this lower bound 
with near optimal dimension dependence. Moreover, the algorithm uses only boolean comparisons 
of function values, not actual function values. This makes the algorithm applicable to situations in 
which the optimization is only able to probably correctly decide if the value of one configuration is 
better than the value of another. This is especially interesting in optimization based on human subject 
feedback, where paired comparisons are often used instead of numerical scoring. The convergence 
rate of the new algorithm is optimal in terms of T and near-optimal in terms of its dependence 
on the ambient dimension. Surprisingly, our lower bounds show that this new algorithm that uses 
only function comparisons achieves the same rate in terms of T as any algorithm that has access to 
function evaluations. 
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2 Problem formulation and background 



We now formalize the notation and conventions for our analysis of DFO. A function / is strongly 
convex with constant r on a convex set B C R d if there exists a constant r > such that 

f(y) > f{x) + (Vf(x),y- x) + T -\\x - y\\ 2 

for all x,y £ B. The gradient of /, if it exists, denoted V/, is Lipschitz with constant L if 
l|V/(ar) — V/(i/)|| < L\\x — y\\ for some L > 0. The class of strongly convex functions with 
Lipschitz gradients defined on a nonempty, convex set B C W 1 which take their minimum in B with 
parameters r and L is denoted by J- t .l.b- 

The problem we consider is minimizing a function / £ J~ t ,l.b- The function / is not explicitly 
known. An optimization procedure may only query the function in one of the following two ways. 

Function Evaluation Oracle: For any point ie8an optimization procedure can observe 

E f (x) = f(x)+w 
where w £ R is a random variable with K[w] = and E[u> 2 ] = a 2 . 

Function Comparison Oracle: For any pair of points x,y £ 6 an optimization procedure can 
observe a binary random variable Cf(x, y) satisfying 

P (C f (x, y) = sign{/(y) - /(*)}) > 1 + min {<5 , (t\f(y) - /(x)!^ 1 } (1) 

for some < do < 1/2, [i > and k > 1. When k = 1, without loss of generality 
assume [i < Sq < 1/2. Note k = 1 implies that the comparison oracle is correct with 
a probability that is greater than 1/2 and independent of x, y. If n > 1, then the oracle's 
reliability decreases as the difference between f(x) and f(y) decreases. 

To illustrate how the function comparison oracle and function evaluation oracles relate to each other, 
suppose Cf(x,y) = sign{Ef(y) — Ef(x)} where Ef(x) is a function evaluation oracle with ad- 
ditive noise w. If w is Gaussian distributed with mean zero and variance a 2 then k = 2 and 
[i > (47Tcr 2 e) (see Appendix lAl. In fact, this choice of w corresponds to Thurston's law of 
comparative judgment which is a popular model for outcomes of pairwise comparisons from human 
subjects |[T2l . If w is a "spikier" distribution such as a two-sided Gamma distribution with shape 
parameter in the range of (0, 1] then all values of k £ (1, 2] can be realized (see Appendix lAl. 

Interest in the function comparison oracle is motivated by certain popular derivative-free optimiza- 
tion procedures that use only comparisons of function evaluations (e.g. Q) and by optimization 
problems involving human subjects making paired comparisons (for instance, getting fitted for pre- 
scription lenses or a hearing aid where unknown parameters specific to each person are tuned with 
the familiar queries "better or worse?"). Pairwise comparisons have also been suggested as a novel 
way to tune web-search algorithms |fl3l . Pairwise comparison strategies have previously been an- 
alyzed in the finite setting where the task is to identify the best alternative among a finite set of 
alternatives (sometimes referred to as the dueling -bandit problem) |[T3l[T4l . The function compar- 
ison oracle presented in this work and its analysis are novel. The main contributions of this work 
and new art are as follows (i) lower bounds for the function evaluation oracle in the presence of 
measurement noise (ii) lower bounds for the function comparison oracle in the presence of noise 
and (iii) an algorithm for the function comparison oracle, which can also be applied to the function 
evaluation oracle setting, that nearly matches both the lower bounds of (i) and (ii). 

We prove our lower bounds for strongly convex functions with Lipschitz gradients defined on a com- 
pact, convex set B, and because these problems are a subset of those involving all convex functions 
(and have non-empty intersection with problems where / is merely Lipschitz), the lower bound also 
applies to these larger classes. While there are known theoretical results for DFO in the noiseless 
setting |fl5l 151 [lOl , to the best of our knowledge we are the first to characterize lower bounds for 
DFO in the stochastic setting. Moreover, we believe we are the first to show a novel upper bound for 
stochastic DFO using a function comparison oracle (which also applies to the function evaluation 
oracle). However, there are algorithms with upper bounds on the rates of convergence for stochastic 
DFO with the function evaluation oracle |[T5l[T6l . We discuss the relevant results in the next section 
following the lower bounds . 
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While there remains many open problems in stochastic DFO (see Section [6]), rates of convergence 
with a stochastic gradient oracle are well known and were first lower bounded by Nemirovski and 
Yudin 1151 . These classic results were recently tightened to show a dependence on the dimension 
of the problem ifTTl . And then tightened again to show a better dependence on the noise [11] which 
matches the upper bound achieved by stochastic gradient descent J9)- The aim of this work is to 
start filling in the knowledge gaps of stochastic DFO so that it is as well understood as the stochastic 
gradient oracle. Our bounds are based on simple techniques borrowed from the statistical learning 
literature that use natural functions and oracles in the same spirit of IfTTl . 



3 Main results 



The results below are presented with simplifying constants that encompass many factors to aid in 
exposition. Explicit constants are given in the proofs in Sections [4] and [5] Throughout, we denote 
the minimizer of / as x**. The expectation in the bounds is with respect to the noise in the oracle 
queries and (possible) optimization algorithm randomization. 



3.1 Query complexity of the function comparison oracle 

Theorem 1. For every f G J~ t ,l,b let C/ be a function comparison oracle with parameters 
(k, fi, Sq)- Then for n > 8 and sufficiently large T 

!dexp{-c 2 ^} ifn = l 
C3 («) 5^=17 if R> \ 

where the infimum is over the collection of all possible estimators of ' x*j using at most T queries to 
a function comparison oracle and the supremum is taken with respect to all problems in T t ^l,b an d 
function comparison oracles with parameters (k, p, Sq). The constants Ci, c 2 , C3 depend the oracle 
and function class parameters, as well as the geometry ofB, but are independent ofT and n. 

For upper bounds we propose a specific algorithm based on coordinate-descent in Section [5] and 
prove the following theorem for the case of unconstrained optimization, that is, B = W 1 . 

Theorem 2. For every f 6 Ft.l.b with B = K" let Cf be a function comparison oracle with 
parameters (k, fi, Sq)- Then there exists a coordinate-descent algorithm that is adaptive to unknown 
k > 1 that outputs an estimate xt after T function comparison queries such that with probability 
1-5 

sup E [f{x T ) - f(x})] < 



ciexp|-c 2 y^| ifn = l 
_c 3n (i)^T) ifn>l 



where a, C2, C3 depend the oracle and function class parameters as well as T,n, and 1/8, but only 
poly-logarithmically. 



3.2 Query complexity of the function evaluation oracle 

Theorem 3. For every f € -Ft.l.b let Ef be a function evaluation oracle with variance a 2 . Then 
for n > 8 and sufficiently large T 



inf sup E [f(x T ) - f(x})] > c 




where the infimum is taken with respect to the collection of all possible estimators of x*^ using just 
T queries to a function evaluation oracle and the supremum is taken with respect to all problems in 
J^r.L.B and function evaluation oracles with variance a 2 . The constant c depends on the oracle and 
function class parameters, as well as the geometry ofB, but is independent ofT and n. 

Because a function evaluation oracle can always be turned into a function comparison oracle (see 
discussion above), the algorithm and upper bound in Theorem 2 with k = 2 applies to many typical 
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1/2 

function evaluation oracles (e.g. additive Gaussian noise), yielding an upper bound of (n 3 cr 2 /T) 
ignoring constants and log factors. This matches the rate of convergence as a function of T and a 2 , 
but has worse dependence on the dimension n. 

Alternatively, under a less restrictive setting, Nemirovski and Yudin proposed two algorithms for 
the class of convex, Lipschitz functions that obtain rates of n 1 / 2 /T 1 ^ 4 and p(n) /T 1 / 2 , respectively, 
where p(n) was left as an unspecified polynomial of n 03). While focusing on stochastic DFO with 
bandit feedback, Agarwal et. al. built on the ideas developed in ifTSl to obtain a result that they 
point out implies a convergence rate of n 16 /T 1 / 2 in the optimization setting considered here |fl6l . 
Whether or not these rates can be improved to those obtained under the more restrictive function 
classes of above is an open question. 

A related but fundamentally different problem that is somewhat related with the setting considered 
in this paper is described as online (or stochastic) convex optimization with multi-point feedback 
lfT8l 151 [T9l . Essentially, this setting allows the algorithm to probe the value of the function / plus 
noise at multiple locations where the noise changes at each time step, but each set of samples at each 
time experiences the same noise. Because the noise model of that work is incompatible with the one 
considered here, no comparisons should be made between the two. 



4 Lower Bounds 



The lower bounds in Theorems 1 and 3 are proved using a general minimax bound 1201 Thm. 2.5]. 
Our proofs are most related to the approach developed in ll2D for active learning, which like opti- 
mization involves a Markovian sampling process. Roughly speaking, the lower bounds are estab- 
lished by considering a simple case of the optimization problem in which the global minimum is 
known a priori to belong to a finite set. Since the simple case is "easier" than the original optimiza- 
tion, the minimum number of queries required for a desired level of accuracy in this case yields a 
lower bound for the original problem. 

The following theorem is used to prove the bounds. In the terms of the theorem, / is a function to 
be minimized and Pf is the probability model governing the noise associated with queries when / 
is the true function. 

Theorem 4. I\20\ Thm. 2.5] Consider a class of functions T and an associated family of probability 
measures {Pf}f^jr. Let M > 2 be an integer and /o, /i, . . . , fu be functions in T. Let d(-, •) : 
J- x J- — > R be a semi-distance and assume that: 

1- d{f u fj) >2s> 0, for all < i < j < M, 

2- h2Zf=xKL{Pi\\Po) <alogM, 

where the Kullback-Leibler divergence KL(Pi \\Pq) :~ J log ^p^dPi is assumed to be well-defined 
(i.e., Pq is a dominating measure) and < a < 1/8 . Then 



>0. 



infsupP(d(/,/)> S ) > inf. max V(d(fJ)> S ) > ^ (l - 2a - 2 J^jfa) 
f f^T f /e{/o,—./M} h-vjw v v b / 

where the infimum is taken over all possible estimators based on a sample from Pf. 

We are concerned with the functions in the class T := J- t ,l,b- The volume of B will affect only 
constant factors in our bounds, so we will simply denote the class of functions by T and refer 
explicitly to B only when necessary. Let Xf := argiiiin^ f{x), for all / € T . The semi-distance we 
use is d(f,g) := \\xf — x g \\, for all f,g £ J 7 . Note that each point in B can be specified by one of 
many / e J. So the problem of selecting an / is equivalent to selecting a point x G B. Indeed, the 
semi-distance defines a collection of equivalence classes in T (i.e., all functions having a minimum 
at x £ B are equivalent). For every / £ T we have inf ge jr f(x g ) = inf^gg f(x), which is a useful 
identity to keep in mind. 

We now construct the functions fa, fi, ■ . ■ , /a/ that will be used for our proofs. Let ft = { — 1, l} ra so 
that each lj £ Q is a vertex of the <i-dimensional hypercube. Let V C £1 with cardinality |V| > 2™/ 8 
such that for all ui ^ ui' £ V, we have p(cu, u') > n/8 where p(-, •) is the Hamming distance. It is 
known that such a set exists by the Varshamov-Gilbert bound ||20l Lemma 2.9]. Denote the elements 
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of V by wo, wi, ... , lum. Next we state some elementary bounds on the functions that will be used 
in our analysis. 

Lemma 1. For e > define the set B C K™ to be the ball of radius e and define the functions 



on B: fi{x) 



, for i = 0,..., M, u>i £ V, and Xj := argmin^, fi(x) — eu>i. Then 



for all < i < j < M and x € B the functions fi(x) satisfy 

1. fi is strongly convex-r with Lipschitz-L gradients and Xi G B 

2. \\xi - Xj\\ > e^f 

3. \fi(x) - fj(x)\ <2rne 2 . 

We are now ready to prove Theorems 1 and 3. Each proof uses the functions fo, ■ . ■ , Hi a bit 
differently, and since the noise model is also different in each case, the KL divergence is bounded 
differently in each proof. We use the fact that if X and Y are random variables distributed according 
to Bernoulli distributions P\ and Py with parameters 1/2 + fj, and 1/2 — fi, then KL(P\\\Py) _ 
4/r7(l/2 -pi). Also, if X -7V(^ Y ,cr 2 ) =: Px and Y ~ J\f(p, Yl cr 2 ) =: P y then KL(P X | \P Y ) = 

4.1 Proof of TheoremQ] 

First we will obtain the bound for the case k > 1. Let the comparison oracle satisfy 

F(C fi (x,y) = signUM - M*)}) = \ + min {^(y) - Mx)\ K ~\ 6 } . 

In words, C/^x,?/) is correct with probability as large as the right-hand-side of above and is 
monotonic increasing in fi(y) — fi(x). Let be a sequence of T pairs in B and let 

{Cf i (xk,yk)}'k = i be the corresponding sequence of noisy comparisons. We allow the sequence 
{xk, 2/fe}fe = i to be generated in any way subject to the Markovian assumption that Cf i (xk,yk) given 
{xk,yk) is conditionally independent of {a;,;, yi}i<k- For i = 0, . . . , M, and I = 1, . . . ,T let 
denote the joint probability distribution of {xk, yk, Cf i (xk, yfc)}fc=i' let Qij denote the conditional 
distribution of Cf i (xi, ye) given (xi , ye), and let Si denote the conditional distribution of (xi, ye) 
given {xk, yk,Cf i (xk, 2/fc)}fc=i- Note that Se is only a function of the underlying optimization al- 
gorithm and does not depend on i. 



KL(P iiT \\P jiT )=Ep iil 



log 



Pi 



i.T 



P, 



j.T 



= E 



Pi,T 



log 



E^ 



1 Qi,t 

log 7T~ 



{x k ,yk}I=i 



Y\e=i Qi,eSt 



< T sup E Pi 

xi,yi£B 



'Pi,: 



log 



m=iQ 



E 



Pi 



log 



i.l 



Q. 



3.1 



By the second claim of Lemma Q] — /j (a;)| < 2rne 2 , and therefore the bound above is 

less than or equal to the KL divergence between the Bernoulli distributions with parameters | ± 

fi (2rne 2 ) , yielding the bound 



KL(P, T |P,- T ) < 



47> 2 (2rne 



2\2(«-i) 



< 16Tfi 2 (2rne 



2 x2(k-1) 



l/2-/i(2rne 2 ) (re ~ 

provided e is sufficiently small. We also assume e (or, equivalently, B) is sufficiently small so that 
\fi(x) — fj{x)\ K ~ 1 < <5o- We are now ready to apply Theorem|4] Recalling that AI > 2™/ 8 , we 
want to choose e such that 

KL(P. T |Pj.T) < 161> 2 (2rne 2 ) 2(K_1) < a- log(2) < alogM 

8 

with an a small enough so that we can apply the theorem. By setting a = 1/16 and equating the two 



sides of the equation we have e = st ■= (7) ^ ( Jols^r ) ^ ^ 
sequence of sets Bt by the definition of the functions in LemmaQ]). Thus, the semi-distance satisfies 



(note that this also implies a 



d(fj,fi) = ||a?j - ari|| > y/n/2e T > 



1 



/2\ 1/2 / nlog(2) 
2V2\t) V2048 fi 2 T 



4(K-1) 



=: 2s T ■ 
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Applying Theorem[4]we have 

inf sup P(||xr— Xf II > S7O > inf max P(||x? — xA\ > St) = inf max F(d(f, fa) > St) 
/ fe? 1 J ie{o,...,M] J J ie{o,...,Af} 



> 



irS7( 1 - 2fl - 2 V&) >1 / 7 ' 



where the final inequality holds since M > 2 and a = 1/16. Strong convexity implies that f(x) — 
fi x f) > ill 2 - ~~ x f\\ 2 f° r all / G J 7 and x G B. Therefore 

inf supP (f{xj) - f{x f ) > J4) > inf max P (/.(x,) - fifa) > ^-s 2 T ) 

> inf max P I -lb?— x,-|| 2 > —Sn 



j te{o,...,Af} V2 

= inf max P ( ||x?- xj > s T ) > 1/7 . 
J ie{o,...,M} V J / 

Finally, applying Markov's inequality we have 

1 / 1 \ / nlog(2) N^^T 



inf sup E 
7 /£■?" 



/(*/) - f(x f ) 



> 



7 \32 J \20A8fi 2 T 



4.2 Proof of TheoremUfor « = 1 



To handle the case when k = 1 we use functions of the same form, but the construction is slightly 
different. Let £ be a positive integer and let M = £". Let {Ci}f£i be a set of uniformly space points 
in B which we define to be the unit cube in R", so that ||£; — £j\\ > for all i 7^ j. Define 
fi(x) := J||x — £i|| 2 , i = 1, . . . , M. Let s := ^ so that d(fi, fj) := \\x* — x*j\\ > 2s. Because 
K = 1, we have P (C^ (x, y) = sign{/j(i/) — fi{x)}) > /1 for some [i > 0, all i G {1, . . . , M}, and 
all x,y E B. We bound KT^P^tII-P/.t) in exactly the same way as we bounded it in Section |4~T1 
except that now we have Cf t (xk 7 yk) ~ Bernoulli(i + //) and C/^ (xk,Vk) ~ Bernoulli(i — /x). It 
then follows that if we wish to apply the theorem, we want to choose s so that 

KL(P itT \P jtT ) <27> 2 /(l/2- M ) < alogM = an log (i) 

for some a < 1/8. Using the same sequence of steps as in Section l4Tl we have 



inf sup E 



/(*/) ~ /(*/) 



1 r /1\ 2 f 1287> 2 



> ~n o eX P 



7 2 V2/ ^ i n(l/2-/i) 



4.3 Proof of TheoremE] 

Let /i for alH = 0, . . . , M be the functions considered in LemmaQ] Recall that the evaluation oracle 
is defined to be Ef (x) := f(x) + w, where w is a random variable (independent of all other random 
variables under consideration) with K[w] = and E[w 2 ] = a 2 > 0. Let {xfc}£ =1 be a sequence 
of points in B C W 1 and let {Ef(xk)}k=i denote the corresponding sequence of noisy evaluations 
of / G J 7 . For £ = 1, . . . , T let Pj^ denote the joint probability distribution of {xk,Ef t (xk)Y k=1 , 
let Qj £ denote the conditional distribution of Ef t (xk) given Xk, and let St denote the conditional 
distribution of xt given {%k, Ef(xk)} £ k Z\- is a function of the underlying optimization algorithm 
and does not depend on i. We can now bound the KL divergence between any two hypotheses as in 
Section |4~T1 



KL(P ijT ||Pj,T) < T sup E Pil 




log 


Xi 















To compute a bound, let us assume that w is Gaussian distributed. Then 

KL(P iiT ||P i>T ) < TsupKL(AA(/ 8 (z), ( 7 2 )||AA(/ J (z), ( 7 2 )) 

= ^Hupl/ifz) - /.(z)! 2 < ^ (2™ £ 2 ) 2 
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by the third claim of Lemma[T] We then repeat the same procedure as in Section l4~Tl to attain 



5 Upper bounds 



The algorithm that achieves the upper bound using a pairwise comparison oracle is a combination 
of standard techniques and methods from the convex optimization and statistical learning literature. 
The algorithm is explained in full detail in Appendix iBl and is summarized as follows. At each 
iteration the algorithm picks a coordinate uniformly at random from the n possible dimensions 
and then performs an approximate line search. By exploiting the fact that the function is strongly 
convex with Lipschitz gradients, one guarantees using standard arguments that the approximate line 
search makes a sufficient decrease in the objective function value in expectation [23, Ch.9.3]. If 
the pairwise comparison oracle made no errors then the approximate line search is accomplished 
by a binary-search-like scheme, essentially a golden section line-search algorithm (24). However, 
when responses from the oracle are only probably correct we make the line-search robust to errors 
by repeating the same query until we can be confident about the true, uncorrupted direction of the 
pairwise comparison using a standard procedure from the active learning literature 11251 (a similar 
technique was also implemented for the bandit setting of derivate-free optimization (8)). Because 
the analysis of each component is either known or elementary, we only sketch the proof here and 
leave the details to the supplementary materials. 



5.1 Coordinate descent 



Given a candidate solution x k after k > iterations, the algorithm defines a search direction dk = e, 
where i is chosen uniformly at random from the possible n dimensions and is a vector of all zeros 
except for a one in the ith coordinate. We note that while we only analyze the case where the search 
direction dk is a coordinate direction, an analysis with the same result can be obtained with dk 
chosen uniformly from the unit sphere. Given dk, a line search is then performed to find an a k € R 
such that /(xfe+i) — f(xk) is sufficiently small where x k +i = x k + a k d k - In fact, as we will see in 
the next section, for some input parameter rj > 0, the line search is guaranteed to return an a k such 
that \a k — a* | < 77 where a* = min Qe R f(x k + d k a*). Using the fact that the gradients of / are 
Lipschitz (L) we have 

f(x k + a k d k ) - f(x k + a*d k ) < ^||K - «*K|| 2 = ||a fc -a*\ 2 < | ?7 2 . 

If we define dk — — ( v f( x k)> d >') t jj en we h ave 

f(x k + a k d k ) - f(x k ) < f{x k + a*d k ) - f(x k ) + ^r\ 2 

* ft x-j( ft ^ L 2/ (V/(gfc),4) 2 , L a 
< f(x k + a k d k ) - f(x k ) + -n < — + -77 

where the last line follows from applying the fact that the gradients are Lipschitz (L). Arranging the 
bound and taking the expectation with respect to d k we get 

E [f(x k+1 ) - /(**)] - f ?7 2 < E [/(**) - /(*•)] - E[ " V 2 / ^ ) " 2] < E [f(x k ) f(x*)\ (1 - ^) 

where the second inequality follows from the fact that / is strongly convex (t). If we define p k ■= 
E [f(xk) — f(x*)] then we equivalently have 

2nLV / t \ ( 2nL 2 n 2 \ ( r \ k ( 2nL 2 n 2 
^ V~ -i 1 ^ 4nl) \ Pk ^j^ 1 -^) {<» ^ 

which leads to the following result. 
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Theorem 5. Let f e J~ t ,l,b with B = R™. For any r] > assume the line search returns an that 
is within rj of the optimal after at most Ti(rf) queries from the pairwise comparison oracle. If xk is 
an estimate of x* = argmina; f(x) after requesting no more than K pairwise comparisons, then 

supE [f{XK) - fix*)] < whenever K > log T^r?) 

f t t \ ri^inL^/T J 

where the expectation is with respect to the random choice of dk at each iteration. 

This implies that if we wish sup^ ¥,[f(xx) — f{ x *)] < e it suffices to take 77 = A " L -i so that at 
most log ( ^^ x °*~il^ x - J Ti {\/ A " L i ) pairwise comparisons are requested. 

5.2 Line search 

This section is concerned with minimizing a function f(xk +akdk) over some oik € R- In particular, 
we wish to find an afc S R such that — a*\ < rj where a* = min Qe K f(xk + dka*). First assume 
that the function comparison oracle makes no errors. The line search operates by maintaining a pair 
of boundary points a + , or such that if at some iterate we have a* g [a~ , a + ] then at the next iterate, 
we are guaranteed that a* is still contained inside the boundary points but \a + —a~ \ <— \\a + — aT \. 
An initial set of boundary points a + > and or < are found using simple binary search. Thus, 
regardless of how far away or close a* is, we converge to it exponentially fast. Exploiting the fact 
that / is strongly convex (r) with Lipschitz (L) gradients we can bound how far away or close a* 
is from our initial iterate. 

Theorem 6. Let f G J~ t ,l,b with B = R™ and let Cf be a function comparison oracle that makes 
no errors. Let x G R™ be an initial position and let d £ R™ be a search direction with \ \d\\ = 1. If 
ctK is an estimate of a* = argmin Q f(x + da) that is output from the line search after requesting 
no more than K pairwise comparisons, then for any i] > 

\a K ~ a*\ < 7) whenever K > 2 log 2 ( w v '—^f »\ . 

5.3 Making the line search robust to errors 

Now assume that the responses from the pairwise comparison oracle are only probably correct in 
accordance with the model introduced above. Essentially, the robust procedure runs the line search 
as if the oracle made no errors except that each time a comparison is needed, the oracle is repeatedly 
queried until we can be confident about the true direction of the comparison. This strategy applied 
to active learning is well known because of its simplicity and its ability to adapt to unknown noise 
conditions 1251 . However, we mention that when used in this way, this sampling procedure is known 
to be sub-optimal so in practice, one may want to implement a more efficient approach like that of 
l2T1 . Nevertheless, we have the following lemma. 

Lemma 2. 4251/ For any x,y £ B with P (C /(x , y) = sign{f(y) — f(x)}) = p, with probability 
at least 1 — 6 the coin-tossing algorithm of 425f correctly identifies the sign o/E [Cf(x, y)] and 

requests no more than ^0§z^s log 2 f ^i/^-pj 2 J P alrwlse comparisons. 

It would be convenient if we could simply apply the result of Lemma|2]to our line search procedure. 
Unfortunately, if we do this there is no guarantee that \f(y) — f(x)\ is bounded below so for the 
case when k > 1, it would be impossible to lower bound \l/2 — p\ in the lemma. To account 
for this, we will sample at multiple locations per iteration as opposed to just two in the noiseless 
algorithm to ensure that we can always lower bound |l/2 — p\. Intuitively, strong convexity ensures 
that / cannot be arbitrarily fiat so for any three equally spaced points x, y, z on the line dk, if 
f(x) is equal to f(y), then it follows that the absolute difference between f(x) and f(z) must be 
bounded away from zero. Applying this idea and union bounding over the total number of times 
one must call the coin-tossing algorithm, one finds that with probability at least 1 — 6, the total 
number of calls to the pairwise comparison oracle over the course of the whole algorithm does 

not exceed 6 faf- (a) 2 ^" 1 ) i og 2 ^ /(*o)-/00 ^ l og (n/<5)) . By finding a T > that satisfies this 
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bound for any e we see that this is equivalent to a rate of O log(n/ 5) (-^) 2(K 1( j for k > land 
O (exp {-cy^|^}) for * = 1 ' ignoring Polylog factors. 

6 Conclusion 

This paper presented lower bounds on the performance of derivative-free optimization for (i) an ora- 
cle that provides noisy function evaluations and (ii) an oracle that provides probably correct boolean 
comparisons between function evaluations. Our results were proven for the class of strongly convex 
functions but because this class is a subset of all, possibly non-convex functions, our lower bounds 
hold for much larger classes as well. Under both oracle models we showed that the expected error 
decays like ((n/T) 1 / 2 ). Furthermore, for the class of strongly convex functions with Lipschitz 

gradients, we proposed an algorithm that achieves a rate of O (n(n/T) 1 / 2 ) for both oracle mod- 
els which shows that the lower bounds are tight with respect to the dependence on the number of 
iterations T and no more than a factor of n off in terms of the dimension. 

A number of open questions still remain. In particular, one would like to resolve the gap between 
the lower and upper bounds with respect to the dependence on the dimension. Due to real world 
constraints, it is also desirable to extend the pairwise comparison algorithm to operate under the 
conditions of constrained optimization where B is a convex, proper subset of M. d . Also, while the 
analysis of our algorithm relies heavily on the assumption that the function is strongly convex with 
Lipschitz gradients, it is unclear whether these assumptions are necessary to achieve the same rates 
of convergence. Developing a practical algorithm that achieves our lower bounds and does not suffer 
from these limiting assumptions would be a significant contribution. 
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A Bounds on (k, fx, S ) for some distributions 

In this section we relate the function evaluation oracle to the function comparison oracle for some 
common distributions. That is, if Ef(x) = f(x) + w for some random variable w, we lower 
bound the probability r)(y, x) := P(sign{£/(y) — Ef(x)} = sign{/(y) — f(x)}) in terms of the 
parameterization of ((T|). 

Lemma 3. Let w be a Gaussian random variable with mean zero and variance a 2 . Then 
r,(y, x)>\+ min {-£-, ~ /Ml}- 

Proof. Notice that rj(y, x) = F(Z + \f(y) — f(x)\/^/2a 2 > 0) where Z is a standard normal. The 
result follows by lower bounding the density of Z by -^==1{\Z\ < 1} and integrating where 1{-} 
is equal to one when its arguments are true and zero otherwise. □ 

We say w is a 2-sided gamma distributed random variable if its density is given by 
2T(a) M" -1 ^ - ^^ for x € [—00,00] and a, (3 > 0. Note that this distribution is unimodal only 
for a e (0, 1] and is equal to a Laplace distribution for a = 1. This distribution has variance 

a 2 = a/f3 2 . 

Lemma 4. Let w be a 2-sided gamma distributed random variable with parameters a € (0, 1] and 
(3 > 0. Then V (y,x) > \ + min (f) 2 " , ~ /Ml 2 "}- 

Proof. Let Ef(y) = f(y) + w and Ef(x) = f(x) + w' where w and w' are i.i.d. 2-sided gamma 
distributed random variables. If we lower bound e~ /3 ' T ' with e~ Q l{|a;| < a//3} and integrate we 
find that P(-t/2 < w < 0) > min (f ) Q , Kg(t/2) a }. And by the symmetry and 

independence of w and w' we have F(-t < w - w') > \ + P(-i/2 < tu < 0)P(-t/2 < io < 0). 

□ 

While the bound in the lemma immediately above can be shown to be loose, these two lemmas are 
sufficient to show that the entire range of n £ (1, 2] is possible. 



B Upper Bounds - Extended 



The algorithm that achieves the upper bound using a pairwise comparison oracle is a combination of 
a few standard techniques and methods pulled from the convex optimization and statistical learning 
literature. The algorithm can be summarized as follows. At each iteration the algorithm picks a 
coordinate uniformly at random from the n possible dimensions and then performs an approximate 
line search. By exploiting the fact that the function is strongly convex with Lipschitz gradients, one 
guarantees using standard arguments that the approximate line search makes a sufficient decrease in 
the objective function value in expectation 11231 Ch.9.3]. If the pairwise comparison oracle made no 
errors then the approximate line search is accomplished by a binary-search-like scheme that is known 
in the literature as the golden section line-search algorithm [24j. However, when responses from the 
oracle are only probably correct we make the line-search robust to errors by repeating the same 
query until we can be confident about the true, uncorrupted direction of the pairwise comparison 
using a standard procedure from the active learning literature ll25l . 



B.l Coordinate descent algorithm 

Theorem 7. Let f G E t _l.b with B = M". For any t] > assume the line search in the algorithm of 
Figure\l\requires at most Te(rj) queries from the pairwise comparison oracle. If xk is an estimate 
of x* = argmin^ f(x) after requesting no more than K pairwise comparisons, then 

supE[/(a^) - f{x m )] < whenever K>— log (1^—1^1) T e ( V ) 

where the expectation is with respect to the random choice of dj~ at each iteration. 
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n-dimensional Pairwise comparison algorithm 

Input: x Q e R n , rf > 
Fork=0,l,2,... 

Choose dk = for i G {1, . . . , n} chosen uniformly at random 
Obtain a k from a line-search such that 

\ot k — c* | < ?? where a* = argniin a /(x^ + acta) 
Xfc+i = Xfc + afedfe 
end 



Figure 1: Algorithm to minimize a convex function in d dimensions. Here e$ is understood to be a 
vector of all zeros with a one in the ith position. 



Proof. First note that \\d k \\ = 1 for all k with probability 1. Because the gradients of / are Lipschitz 
(L) we have from Taylor's theorem 

alL 



f{x k+ i) < f{x k ) + (Vfix k ),a k d k ) + 
Note that the right-hand-side is convex in a k and is minimized by 

(V/(jQ,dfc) 
a k = . 

However, recalling how a k is chosen, if a* = argmin Q f(x k + ad k ) then we have 



f(x k + a k d k ) - f{x k + a*d k ) < ^||(a fc -a*K|| 2 = ba k -a*\ 2 < -r? 2 . 



This implies 



f(x k + a k d k ) - f(x k ) < f(x k + a*d k ) - f(x k ) + ^rj 2 
< f(x k + a k d k ) - f(x k ) + ^rj 2 



< 



(Vf(x k ),d k )' 



2L 



L 2 



Taking the expectation with respect to d k , we have 

"(V/(x fc ),<4) 2 



E[/(x fe+1 )]<E[/(x fe )]-E 
= E [f(x k )] - E 
= E[/(x fc )]-E 



L 



E 



2L 

(v/(x fc ),4 



2L 

l|V/(x fc )|| 2 ' 



d , ... 7 d k - 



L 



2nL 



L 



where we applied the law of iterated expectation. Let x* = argmin x fix) and note that x* is a 
unique minimizer by strong convexity (r). Using the previous calculation we have 

"[11^)111 <E[/(x fc ) -/(*•)] (1 



2nL 



E [/(a*+i) - f{x*)} — -§■?/ < E [fix k ) - fix*)} 
where the second inequality follows from 

ifix k )-f(x*)f<i(\7fix k ),x k -x*)f 

<||V/(x fc )|| 2 ||x fc -x*|| 2 < ||V/(x fc )|| 2 
If we define p k := E [f{x k ] — fix*)] then we equivalently have 



Pk+1 



2nL 2 r, 2 



< 1- 



4nL 



Pk 



2nL 2 V 2 



< 1- 



T \l 



4nL, 



ifix k )-fix*)). 

2nLW 
po 



which completes the proof. 



□ 



This implies that if we wish sup^ E[/(x#) — fi x *)] < e it suffices to take rj = \l ^jji so that at 



most 



AnL 



log 



f(x )-f(x*) 
^72 



Te (\/ ^L'i ) pairwise comparisons are requested. 
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B.2 Line search 



This section is concerned with minimizing a function f(x k + adk) over some a € R. Because we 
are minimizing over a single variable, a, we will restart the indexing at such that the line search 
algorithm produces a sequence oiq, a\, . . . , ax:'- This indexing should not be confused with the 
indexing of the iterates x\, X2, ■ • ■ , xk ■ We will first present an algorithm that assumes the pairwise 
comparison oracle makes no errors and then extend the algorithm to account for the noise model 
introduced in Section|2] 

Consider the algorithm of Figure [2] At each iteration, one is guaranteed to eliminate at least 1/2 
the search space at each iteration such that at least 1/4 the search space is discarded for every 
pairwise comparison that is requested. However, with a slight modification to the algorithm, one 
can guarantee a greater fraction of removal (see the golden section line-search algorithm). We use 
this sub-optimal version for simplicity because it will help provide intuition for how the robust 
version of the algorithm works. 



One Dimensional Pairwise comparison algorithm 

Input: x e R", d G R", 77 > 

Initialize: ceo = 0, a$ = ao + 1, Qq = Q 'o — 1> = 
If Cf(x, x + da^) > and Cf(x, x + da$) < 
a+ = 

end 

If Cf(x, x + da.Q ) > and Cf(x, x + da^) < 
= 

end 

While C f {x,x + dal) < 

a k+l = ^ a k ' k = k + 1 

end 

While Cf(x,x + da~) < 
a k+l = ^ a k^ k ~ k + 1 

end 

a k = \i a k + a k ) 

While \a+-a^\> V /2 

if Cj(x + da k , x + d + a£)) <0 

a k +i = \{a k + a+), a+ +1 = a+, a^ +1 = a k 
else if Cf(x + dak,x + d^(ak + a^j) < 

else 

a k +i = a k , at+i = \{^k + o£), a^ +1 = \{a k + a£ ) 
end 
end 

Output: ak 



Figure 2: Algorithm to minimize a convex function in one dimension. 



Theorem 8. Let f G J~ t ,l,b with B = W 1 and let Cf be a function comparison oracle that makes 
no errors. Let x € W 1 be an initial position and let d € R™ be a search direction with \ \d\\ = 1. If 
oik is on estimate of a* = argmin Q f(x + da) that is output from the algorithm of Figure\2\after 
requesting no more than K pairwise comparisons, then for any r\ > 



\aK — a* I < 77 whenever K > 2 log 



/256i (f(x) - f(x + da*))' 



2 ? 2 



Proof. First note that if ax is output from the algorithm, we have i|ajf — a* \ < \a K — a K \ < \r\, 
as desired. 

We will handle the cases when | or* | is greater than one and less than one separately. First assume that 
|a*| > 1. Using the fact that / is strongly convex (r), it is straightforward to show that immediately 
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after exiting the initial while loops, (i) at most 2 + ^ log 2 (- (f(x) — f(x + da*))) pairwise com- 

1 /2 

parisons were requested, (ii) a* G [a^, a^}, and (in) \ot~^ — < (- (f(x) — f(x + da*))) 
We also have that a* 6 a fc+J if a * € [ajjT , a^"] for all k. Thus, it follows that 



8 x 1/2 

|a+ - a- +l \ = 2~ l \a+ -aZ\< 2~ l ( - (f(x) - f(x + da*)) 



To make the right-hand-side less than or equal to rj/2, set I = log 2 



(jU(x)-f(x+d a »))) 

n/2 



1/2 



This brings the total number of pairwise comparison requests to no more than 

21og2 / 32(/(x)-/(x+da-)) 

Now assume that \a*\ < 1, A straightforward calculation shows that the while loops will terminate 
after requesting at most 2 + ^ log 2 (— ) pairwise comparisons. And immediately after exiting the 
while loops we have \at — olZ \ < 2. It follows by the same arguments of above that if we want 
\a^ +l — Q!fc_|_;| < 77/2 it suffices to set I = log 2 This brings the total number of pairwise 

comparison requests to no more than 2 log 2 (r^j- F° r sufficiently small -q both cases are positive 
and the result follows from adding the two. □ 



This implies that if the function comparison oracle makes no errors and it is given an 
iterate x k and direction d k then T t {yfijj&) < 21og 2 ^^(f{^f(x k +d k a ')) ^ 
which brings the total number of pairwise comparisons requested to at most 



nL j Q g ^ f{xg)-f{x*) j j Q g j 



B.3 Proof of Theorem|2] 

We now introduce a line search algorithm that is robust to a function comparison oracle that makes 
errors. Essentially, the algorithm consists of nothing more than repeatedly querying the same random 
pairwise comparison. This strategy applied to active learning is well known because of its simplicity 
and its ability to adapt to unknown noise conditions j25). However, we mention that when used 
in this way, this sampling procedure is known to be sub-optimal so in practice, one may want to 
implement a more efficient approach like that of lETI . Consider the subroutine of Figure [3] 



Repeated querying subroutine 

Input: x, y e W n , d>0 
Initialize: S = 0, 1 = -1 
do 

1 = 1 + 1 

A[ = / (i+i)io g (2/iy 

S = SU {2 l i.i.d. draws of C f {x,y)} 

wbile|iE ei es e i|- A i<° 
return sign {X; esGS e 4 }. 



Figure 3: Subroutine that estimates E [Cf(x, y)] by repeatedly querying the random variable. 



Lemma 5. j25l/ For any with P (C/(x, y) = sign{f(y) ~ f(x)}) = p, then with proba- 

bility at least 1 — 5 the algorithm of Figure\3\correctly identifies the sign of E [Cf (x, y)] and requests 
no more than 

log(2/fl , f log(2/J) \ 
4|l/2-pP S2 \4\l/2-p\*J 

pairwise comparisons. 
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It would be convenient if we could simply apply the result of Lemma|2]to the algorithm of Figure [2] 
Unfortunately, if we do this there is no guarantee that \f(y) — f(x) | is bounded below so for the case 
when k > 1, it would be impossible to lower bound |l/2 — p\ in the lemma. To account for this, we 
will sample at four points per iteration as opposed to just two in the noiseless algorithm to ensure 
that we can always lower bound |l/2 — p\. We will see that the algorithm and analysis naturally 
adapts to when k = 1 or k > 1 . 

Consider the following modification to the algorithm of Figure [2] We discuss the sampling process 
that takes place in [a k , a^] but it is understood that the same process is repeated symmetrically in 
[a^T, atk]. We begin with the first two while loops. Instead of repeatedly sampling Cf(x, x + da~^) 
we will have two sampling procedures running in parallel that repeatedly compare a k to ct^. and a k 
to 2a^. As soon as the repeated sampling procedure terminates for one of them we terminate the 
second sampling strategy and proceed with what the noiseless algorithm would do with at assigned 
to be the sampling location that finished first. Once we're out of the initial while loops, instead of 
comparing a k to \(ot k + a^) repeatedly, we will repeatedly compare a k to ^(a k + a^) and a k to 
\{cik + a£ ). Again, we will treat the location that finishes its sampling first as \{a k + a^) in the 
noiseless algorithm. 

If we perform this procedure every iteration, then at each iteration we are guaranteed to remove at 
least 1/3 the search space, as opposed to 1/2 in the noiseless case, so we realize that the number 
of iterations of the robust algorithm is within a constant factor of the number of iterations of the 
noiseless algorithm. However, unlike the noiseless case where at most two pairwise comparisons 
were requested at each iteration, we must now apply Lemma ?? to determine the number of pairwise 
comparisons that are requested per iteration. 

Intuitively, the repeated sampling procedure requests the most pairwise comparisons when the dis- 
tance between the two function evaluations being compared smallest. This corresponds to when 
the distance between probe points is smallest, i.e. when rj/2 < \a k — a*\ < rj. By con- 
sidering this worst case, we can bound the number of pairwise comparisons that are requested 
at any iteration. By strong convexity (r) we find through a straightforward calculation that 
max{|/(x + da fe ) - f(x + d%(a k + a+))\,\f{x + da k ) - f{x + d\(u k +a+))\} > ^r) 2 for 
all k. This implies |l/2 — p\ > ji (j^ 2 )'" so that on on any given call to the repeated querying 
subroutine, with probability at least 1 — 6 the subroutine requests no more than O ( ^~ f-pffi) ) 

pairwise comparisons. However, because we want the total number of calls to the subroutine to 
hold with probability 1 — 6, not just one, we must union bound over 4 pairwise comparisons 
per iteration times the number of iterations per line search times the number of line searches. 
This brings the total number of calls to the repeated query subroutine to no more than 4 x 

3 lno . I 256Lmax k (f(x k )-f(x k +d k o£)) \ AnL i ( f(x )-f(x') \ _ n ( L ]n(f 2 ( f{xg)-f{.x*) \ \ 

2 l0 S2 ^ T^p ) x — iog ^ 7) 2 2nL 2 /T ) - U {n T log ) j. 

If we set rj = (-j^j) so that E [/(xk) — f(x*)} < e by Theorem|7] then the total number of 
requested pairwise comparisons does not exceed 

By finding a T > that satisfies this bound for any e we see that this is equivalent to a rate of 
O (nlog(n/6) (f) 2<K - for k > 1 and O (cxp |-cy^|^j) for k = 1, ignoring polylog 
factors. 
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